MNMST: topology of cell networks leverages identification of spatial domains from spatial transcriptomics data

Advances in spatial transcriptomics provide an unprecedented opportunity to reveal the structure and function of biology systems. However, current algorithms fail to address the heterogeneity and interpretability of spatial transcriptomics data. Here, we present a multi-layer network model for identifying spatial domains in spatial transcriptomics data with joint learning. We demonstrate that spatial domains can be precisely characterized and discriminated by the topological structure of cell networks, facilitating identification and interpretability of spatial domains, which outperforms state-of-the-art baselines. Furthermore, we prove that network model offers an effective and efficient strategy for integrative analysis of spatial transcriptomics data from various platforms. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03272-0.


Mathematical model for learning cell multi-layer networks
Here, we present the detailed description of the MNMST algorithm, which first constructs cell spatial neighbor graph  with cell spatial location by using k-nearest neighbor algorithm (by following Squidpy [1] , the number of neighbors k is 6 for 10 Genomics data, 8 for Stereo-seq and Slide-seq V2 data, and 15 for imaging-based platforms).In the cell spatial neighbor network  , weights on edges measure the similarity of cell pairs in spatial.Recently, evidence demonstrates that high-order topological structure is more precise to characterize patterns [2] .By following Ref.[3] , MNMST takes pointwise mutual information matrix (PMI) to construct cell spatial network  , and each element is defined as where  is the degree of -th cell in network  , and  is the number of non-negative samples (default set to 1).
On the construction of cell expression network  , given the low-dimensional cell expression feature matrix , it is difficult to select an appropriate manner to precisely characterize distances of cells.MNMST addresses this issue by employing sparse self-representation learning (SRL) [8] under the assumption that cells are well represented by their neighbors.Thus, the objective function is formulated as where ‖⋅‖ is  −norm, and constraint ensures no self-loop in the learned cell expression network  .One limitation of the above equation is that the learned network is dense, i.e., clique, where any pair of cells exist an edge.Actually, networks of cells are sparse.Thus, we ensure sparsity of  with  -norm constraint, i.e., min     , s.t.  0.
Moreover, we also expect cells proximal in the original feature space to have similar reconstructive coefficients.In other words, we expect the cell expression network  to preserve the intrinsic geometrical structure of cell expression matrix .The locality preserving projections [9] shows that preservation of topological structure in X can be expressed as trace optimization.i.e., where  ⋅ denotes matrix trace,  is the Laplacian matrix of graph  , and  . is the -th column of matrix  .By combing Eq. ( 2) and Eq. ( 3), the objective function for cell expression network construction is formulated as follows: where  and  are regularization parameters that balance the relative importance of the reconstruction error term, sparse term, and trace optimization term, respectively.

Mathematical model for learning spatial and transcriptional feature of cells
MNMST learns the spatial and transcriptional feature of cells by exploiting cell multi-layer networks.To obtain the relations between different layers of multi-layer networks, MNMST adopts joint nonnegative matrix factorization [4] to simultaneously decompose the cell multi-layer network, i.e., where matrix  is the common basis matrix with the shared features across different layers of the multi-layer network.
Then, MNSMT aims to learn the structural relations of cells, i.e., affinity graph, by exploiting the shared features of cells.Analogously, MNMST also adopts SRL to learn the affinity graph, which is formulated as: where  denotes the coefficient matrix, and  is the error term.To capture the global structure of the network, a low-rank constraint is imposed on the coefficient matrix .Eq. ( 6) is transformed into the regularized rank minimization problem as: The Eq. ( 7) is difficult to solve because of the low-rank constraint.Following [6] , MNMST adopts nuclear norm as an alternative.In this way, Eq. ( 7) is re-formulated as: where ‖⋅‖ * and ‖⋅‖ , are the nuclear norm and  , -norm, i.e., ‖‖ , ∑ ∑  .
By combing Eq. ( 5) and Eq. ( 8), the objective of affinity graph learning is: where  and  are parameters controlling the relative importance of the low-rank constraint and the error term.

Optimization of cell expression network construction
Since the objective function in Eq. ( 4) consists of a differentiable portion      ′ and a non-differentiable portion  , we use an alternating optimization approach to update one variable by fixing the others, and algorithm continues until the convergence criterion is met.
By introducing auxiliary variables  , Eq. ( 4) is transformed into the equivalent form, i.e., The augmented Lagrange function of Eq. ( 10) is defined as where  0 is the positive penalty scalar,  is the Lagrange multiplier, and ⋅,⋅ denotes the inner product of matrices.
According to ADMM (Alternating Direction Method of Multipliers) algorithm [5] , problem in Eq. ( 11) can be solved with the following sub-problems, i.e., Sub-problem  can be approximated with Eq. (12).By setting the partial derivative ℒ =0, the update rule for  is formulated as where e represents element-wise multiplication.

By fixing
[ ] e W , sub-problem  can be approximated by the above iteration, then where  is a soft threshold function defined as ℎ.
These equations lead to the following updating rules as

Optimization of learning cell spatial and transcriptional features
The problem in Eq. ( 9) is also solved by the alternating direction method of multipliers.By introducing the auxiliary variables  and H, Eq. ( 9) is equivalent to the following problem, i.e., The augmented Lagrange function of Eq. ( 16) is formulated as where  0 is the positive penalty scalar,  ,  ,   are the Lagrange multipliers, ⋅,⋅ denotes the inner product of matrices.MNMST updates one variable while fixing the others, which continues until the convergence criterion is reached.We exploit this special structure and optimize the objective by alternatingly updating the variable sets ,  ,  , E, Z and H.
According to ADMM (Alternating Direction Method of Multipliers) algorithm [5] , problem in Eq. ( 17) can be solved with the following sub-problems, i.e., Sub-problem B: by setting the partial derivative B    to 0, the update rule for B is deduced as Analogously, the update rule for  and  is formulated as Sub-problem E : let     / , the optimization of E is reformulated as The above problem can be solved as Eq. ( 24) according to Ref. [6] as Sub-problem  can be approximated as which can be solved by using a singular value thresholding operator according to Ref. [7] .

Identification of spatial domains
After obtaining the matrix Z, the affinity graph is constructed as || | | /2 since Z is unnecessarily symmetric.The Leiden algorithm [11] is deployed on the affinity graph to obtain spatial domains.The procedure of MNMST is described in Algorithm 1.

Algorithm analysis
On the time complexity issue, time for updating ,  and  requires    , where  is the number of iterations.Time for updating  and  is    , where  is the number of dimensions.Time for  ,  and B is    .The complexity for updating  is   .
Therefore, the time complexity of MNMST is    .
On the space complexity issue, MNMST constructs the cell spatial and expression network, requiring space  2 .The space for feature matrix is   , and space for cell affinity graph is   .Therefore, the overall space complexity of MNMST is   .

Running time of MNMST and acceleration strategy
For a fair comparison, all algorithms are executed on the same platform with Intel i9 CPU, 64GB RAM, and an NVIDIA RTX4090.
From Fig. S4 A1 and A2, it is easy to assert that: First, Gitto and stLearn are time-and spaceconsuming, hampering the applications of identification of spatial domains in ST data.And, DeepST reduces running time by sacrificing space, whereas BayesSpace reduces space complexity by sacrificing running time.Second, SCANPY, SEDR and SpaGCN are efficient in terms of running time and space, whereas performance of these is undesirable.In other words, these algorithms improve efficiency by sacrificing accuracy of algorithms.Third, MNMST reaches a good tradeoff between space and running time.Furthermore, it achieves the best performance on the identification of spatial domains, demonstrating that MNMST provides an excellent alternative for current algorithms.
On the MERFISH data [19] , we validate efficiency of algorithms by increasing the number of spots/cells from 1,000 to 30,000.Since SEDR、SpaGCN, and DeepST employ graph convolutional network (GCN), and DeepST is superior to them in terms of accuracy.Therefore, only SCANPY, DeepST and MNMST are selected for a comparison.

Parameter effect for spatial transcriptomics data generated with various platforms
By applying MNMST to spatial transcriptomics data generated by different platform, we investigate how ARI of MNMST changes as values of parameters vary as shown in Fig. S5, where panel A is for 10 Genomics, and B for osmFISH [20] , and C for STARmap [21] data, respectively.Fig. S5 A depicts how ARI of MNMST changes as parameters ,  increase from 0.1 to 80 for DLPFC data, where MNMST is quit stable.In details, as parameter  increases from 0.1 to 1, ARI of MNMST dramatically improves.And, MNMST is quit stable when  ∈ [1,80].
There are several possible reasons to explain this tendency.When  is too small, the contribution of structure constraint for cell affinity graph is limited, thereby reducing performance of MNMST.Increasing value of parameter  ensures structure of spatial domains in affinity graph, enhancing performance of MNMST.Furthermore, when parameter  is too large, MNMST overemphasizes sparsity of cell affinity graph, resulting in unbalance between structure of cell affinity network and spatial domains.Construct the cell topology network by using spatial proximity and morphological similarity; 2.
Construct the cell expression network with Eq. ( 4); Part II: Initialization

5.
Update B according to Eq. ( Update H according to Eq. ( Update Z according to Eq. ( Update E according to Eq. (24);

Supplementary Figures
Fig.S4B1 and B2 describe the running time and space of algorithms with various sizes of data respectively, where missing bar represents failure of algorithms to address the corresponding data.When the number of spots is less than or equal to 20,000, MNMST is worse than DeepST on both space and time.However, MNMST addresses ST data with more than 20,000 spots that DeepST fails to handle.The reason why DeepST fails to address large-scale ST data is that it employs deep learning to obtain features of spots/cells, where the number of parameters for learning exponentially increases, resulting in the expensive time complexity.

Fig. S3
Fig. S3 Spatial domain identification for mouse brain tissue.(A) and (B) Spatial domains identified by MNMST, DeepST, SEDR, stLearn, and SpaGCN in mouse brain posterior and coronal, respectively.(C) Histograms of Silhouette Coefficient (SC) and Davies-Bouldin (DB) scores for spatial domains identified by various algorithms for mouse brain posterior and coronal data.

Fig. S4
Fig. S4 Running time and space of algorithms for different spatial transcriptomics data: Distributions of running time (minutes) (A1) and space (Gigabyte) (A2) of algorithms on the DLPFC dataset, Running time (minutes) (B1) and space (Gigabyte) (B2) of MNMST with CPU and GPU on the DLPFC dataset, and Running time (minutes) (C1) and space (Gigabyte) (C2) of different algorithms on MERFISH data, respectively, where missing bar represents failure of algorithms to address the corresponding data.

Fig. S5
Fig. S5 Parameter effects of MNMST for various spatial transcriptomics data: (A) ARI vs parameter ,  for 10 × Genomics data, (B) ARI vs parameter ,  for osmFISH data, and (C) ARI vs parameter ,  for STARmap data.

Fig. S6
Fig. S6 Spatial domains identified by various algorithms in horizontally aligned mouse brain

Fig. S7
Fig. S7 Performance of MNMST on vertical integration of spatial transcriptomics data.(A)Spatial domains identified by MNMST for slices 151673, 151674, 151675, and 151676 in DLPFC data before and after integration.(B) Performance of DeepST and MNMST on the 3D coordinates of MERFISH data.(C) Performance of DeepST and MNMST on the center slice generated by PASTE.(D) Performance of MNMST and DeepST for mouse brain slice S3 in terms of ARI, where the manual annotation of slide S3 as ground truth.

Fig. S8
Fig. S8 Performance of MNMST on removing batch effects of spatial transcriptomics data.(A) UMAP plots of spatial integrated algorithms, including MNMST, DeepST, stLearn and SEDR, for slices 151673, 151674, 151675, and 151676 in DLPFC data.They represent batches, recognition spatial domains, and ground truth labels, respectively.(B) UMAP plots of the S1, S3 and stacked slice before integration.UMAP plots of spatial domains identified by MNMST for slice S1 and S3 in mouse breast cancer data.(C) UMAP plots of the S1, S2 and stacked slice after integration.UMAP plots of spatial domains identified by MNMST for slice S1 and S3 in mouse breast cancer data.

Fig. S9
Fig. S9 Performance of various algorithms for human breast cancer data.(A) H&E image (left)

Fig. S10
Fig. S10 Differential expressed genes of spatial domains identified by MNMST.(A) Heatmap of

Fig. S11
Fig. S11 Differential expressed genes of spatial domains identified by MNMST associated with

Fig. S12
Fig. S12 Differential expression analysis between domain 1 and 0 on human breast data.(A) Spatial

Fig. S13
Fig. S13 Differential expression analysis between domain 2 and 0 on human breast data.(A) Spatial